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CLEAN VERSION SUBSTITUTE SPECIFICATION 



EFFICIENT PROCESS FOR HANDOVER BETWEEN SUBNET MANAGERS 

RELATED APPLICATIONS 
[0001] The present iaventioii is related to the subject matter of the following commonly 
assigned, co-pending United States Patent Applications filed concurrently herewith: Serial No, 
09/692,342, Docket .No. AUS9-2000-0620) entitled "Method and System for Mbiming An 
Operating System In A System Area Network When A New Device Is Connected"; Serial 
No.09/692,347(Docket No, AUS9-2000-0622) entitled "Method and System For Scalably 
Selecting Unique Transaction Identifiers"; Serial No. 09/692,349 (Docket No. AUS9-2000-0623) 
entitled "Method And System For Reliably Defimng-' and Detertntoing Timeout Values In 
Unreliable Datagrams"; and Serial No. 09/692,353 .(Docket No. AUS9-2000-0624) entitled 
"Method and System For Choosing A Queue Profectioh^Key That is Tamper-proof From An 
Application". The content of the above-referenced . application$. is incorporated herein by 
reference. • * 

BACKGROUND OP THE INVENTION 

Technical Field: 

[0002] The present invention relates in general to'icomputer networics and, in particular, to 
merging of independent compxiter networks. Still more.'pardcularly, the present invention relates 
to a method and system for providing an efficient handover of control between Subnet managers 
of separate subnets, which are being merged' into a siiigle .Subnet. 



Description of the Related Art: 



> 

O 
O 

[0003] The use of I/O interconnects to connect components of a distributed computer sjrstem is j j j 
known in the art. Traditionally, m such systems, individual components are interconnected via a rs^ 
parallel bus, such as a PCIX bus. TheparaUelbus h^ a relatively small number of plug-in ports <L 
for connecting the components. The number of plug-in ports is set (i.e., the number cannot be ^ 
increased). At maximum loading, a PCIX bus transmits data at about 1 Gbyte/second. ^ 
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[0004] The introduction of high performance adapfers 'Ce.g., SCSI adapters), Intemet-based 
networks, and other high petfoiroance network components has resulted in increased demand for 
bandwidth, faster network connections, distributed processing functionality, and scaling with 
processor performance. These and other demands are" Quickly outpacing the current parallel bus 
technology and are making the limitations of parallel buses even more visible. PCDC bus^ fox 
example, is not scalable, i.e., the length of the bus', and number of slots available at a given 
frequency cannot be expanded to meet the needs for more components, and the limitation hinders 
fiirther development of fast, efiBcient distributed networks, such as system area networks. New 
switched network topologies and systems capable of being easily expanded are required to keep 
up with the increasing demands, while allowing the network processes on the expanding network 
to be dynaxnically completed, i.e., without manual input . 

[0005] The present invention recognizes the nefefd' for fester, more efficient computer 
interconnects offering the features demanded by the' developments of technology. More 
specifically, the present invention recognizes the need for providing a mechanism within a 
network such as a System Area Network (SAN) cbnsisting of midtiplc subnets, that provides 
efficient, dynamic combining of two or more subnefts irSo a single network. 
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SUMMARY OF THE INVENTION 
[0006] A method and system i$ disclosed for efficicntiy combining (or merging) subnets having 
individual master subnet managers into a single network with one master subnet manager The 
invention is thus applicable to a distrijbirted computing system, such as a system atea network, 
having end nodes, switches, and routers, and links interconnecting tiiese components. The 
switches and routers interconnect the end nodes aiid route packets, i.e., sub-components of 
messages being transmitted, from a source end node to a target end node. The target end node 
then reassembles the packets into the message, ..... 

[0007] During discovery and configuration of a subnet, a subnet manager creates a Subnet 
Management Database (SMDB) representative of the subnet components being managed. Each 
subnet manager contains an independent SMDB, Wfifeii two or more subnets are merged (i.e., 
linked/connected together) to fonn a single network; ' a single one of the subnet managers is 
selected as the master subnet manager and aH the siib^fets' SMDBs must be merged. The other 
subnet managers are relegated to standby status.' "in-'a preferred embodiment, a SMDB record 
labeling mechanism is utilized to differentiate amloi^'-componcnts from the different subnets lhat 
may have the same parameter values, such as protectidn keys (Pokeys). 

[0008] In one embodiment of the invention, when there are two or more separate SMDBs, each 
maintained by separate master subnet managers on a separate subnet, and those separate subnets 
are linked together (or merged) into one larger su6'ff&,- a process is provided for efficiently 
merging the SMDBs. In this manner the separate subiiets become a single subnet, with a single 
master subnet manager and a single SMDB. 

[0009] All objects, features, and advantages of the pte'Sent invention will become apparent in the 
following detailed written description. * " 
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BRIEF DESCRIPTION OE THE DRAWINGS 
[0010] The novel features believed characteristic of the invention are set forth in the appended 
claims. The invention itself however, as well as a.preferred mode of use, further objects and 
advantages thereof, will best be understood by reference to the following detailed description of 
an illustrative embodiment when read in conjunction with the accompansnng drawings, wherein: 

[0011] Figure lA depicts a system area network (SAN) in which the present invention is 
preferably implemented; 

[0012] Figure IB is a diagram illustrating the software aspects of SAN management model in 
accordance with the present invention; 

[0013] Figure 2 is a diagram illnstrating software* 'aiiJbcts of an exemplary host processor end 
node for the SAN of Figure 1 in accordance with the.'jfjfesent invention; 

[0014] Figure 3 is a diagram of an exemplary host channel adapter of the SAN of Figure 1 in 
accordance with the present invention; 

[0015] Figure 4 is a diagram illxistrating processing of work requests with queues in accordance 
with a preferred embodiment of the present invention; ' • 

[0016] Figure 5 is an illustration of a data packet in accordance wilh a preferred embodiment of 
the present invention; . • 

[0017] Figure 6 is a diagram illustrating a commt&ciitibn over a portion of a SAN fabric; 

[0018] Figure 7 is a diagram illustrating packet trausfery in accordance with the invention; 

[0019] Figure 8 is a diagram illustrating two independent networks, which are linked together to 
form a single network in accordance with a preferred ^mbddiment of the invention; 
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[0020] Figure 9A is a flow chart showing the pipcess of time stamping SMDB entries in 
accordance with a preferred embodiment of the invention; and 

[0021] Figure 9B is a flow chart showing the process absorbing an SMDB by a master subnet 
manager in accordance a preferred embodiment of the invention. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 
[0022] The present invention is directed to a method for efficient handover of control to a single 
master subnet manager when two or more subnets having subnet managers are merged. The 
invention is applicable to a distributed computer system, such as a system area network (SAN)- 
The invention is implemented in a manner that allows the overlaps in merged subnet parameters, 
such as P_Keys, to be resolved. 

[0023] In order to appreciate the environment within which the invention is prefi&rably practiced, 
a description of a SAN configured with routers, switches, and end nodes, etc. is provided below. 
Presentation of the environment and particular furictional aspects of the environment vAiich 
enable the invention to be practiced are provided "with reference to Figures 1-5. Section 
headings have been provided to distinguish the haf dWAfe and software architecture of the SAN. 
However, those skilled in the art understahd that- Ihfe descriptions of either architecture 
necessarily includes references to both components,' ' ' - \ 

SAN HARDWARE ARCBITECTUBE ^ ^ ^ 

[0024] With reference now to the figures and in particular with reference to Figure 1 A, there is 
illustrated an exemplary embodiment of a distributed computer system. Distributed computer 
system 100 represented in Figure lA is provided 'iierely for illustrative purposes, and the 
embodiments of the present invention described beldw can be implemented on computer systems 
of numerous olher types and configurations. For ekkiii^le, computer systems implementing the 
present invention may range from a small server with ohe processor and a few input/output (I/O) 
adapters to very large parallel supercomputer systems Vvt^ith hundreds or thousands of processors 
and thousands of I/O adapters. Furthermore, the presefit invention can be implemented in an 
infrastructure of remote computer systems connected by an Internet or intranet 

[0025] As shown in Figure lA, distributed computer system 100 provides a system area network 
(SAN) 113, which is a high-bandwidth, low-latency network interconnecting nodes within the 
distributed computer system. More than one (1) SAN 113 may be included in a distributed 
computer system 100 and each SAN 113 may coriiprise multiple sub-networks (subnets). 
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[0026] A node is heredn defined to be any componetrt that is attached to one or more links of a 
network. In the illustrated distributed computer system, nodes include host processors 101, 
redundant array of independent disks (RAID) subsystem 103, I/O adapters 105, switches 109A- 
109C, and router 111. The nodes illustrated in Figure; lA are for illustrative purposes only, as 
SAN 113 can connect any number and any type of independent nodes. Any one of the nodes can 
ftinction as an end node, which is herein defined to be a device that originates or finally 
consimes messages or firames in the distributed computer system 100. 

[0027] SAN 113 is the communications and management infirastructure supporting both I/O and 
inter-processor communications (IPC) within distributed computer system 100. Distributed 
computer system 100, illustrated in Figure lA, includes switched communications febric (i,e., 
links, switches and routers) allowing many deviceSs'to^poncurreixtly transfer data with high- 
bandwidth and low latency in a secure, remotely managed environment. End nodes can 
communicate over multiple ports and utilize multiples |5aths through SAN 113. The availability 
of multiple ports and paths through SAN 113 can be ^ployed for fault tolerance and increased- 
bandwidth data transfers. / »• 

. ... J X' 

[0028] SAN 113 includes switches 109A-109C and ioUters 1X1. Switch 109A-109C connects 
multiple links together and allows routing of packets from one link to another link within SAN 
113 using a small header Destination Local Identified ^(DLBD) field. Router 111 is capable of 
routing firames firom one link in a first subnet to anothi^r link in a second subnet using a large 
header Destination Globally Unique Identifier (DGUfD^; Router 111 may be coupled via wide 
area network (WAN) and/or local area network: (LAN> connections to other hosts or other 
routers. 

[0029] In SAN 113, host processor nodes 101 and iAD; nodes 106 include at least one Channel 

f . 

Adapter (CA) to interface to SAN 113, Host processor nodes 101 include central processing 
units (CPUs) 119 and memory 121* In one embodiment, each CA is an endpoint that implements 
the CA interface in sufficient detail to source or sink packets transmitted on SAN 113, As 
illustrated, there are two CA types, Host CA (HCA) 11^* and Target CA (TCA) 127. HCA 117 is 
used by general purpose computing nodes to access sAN 113. In one implementation, HCA 117 

' '( 
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is implemented in hardware. In the hardwate iflflLplemefitatiojx of HCA 117, HCA hardware 
offloads much of CPU and I/O adapter comjnimication' overhead. The hardware implementatioiL 
of HCA 117 also permits multiple concurrent communications over a switched netwoxk without 
the traditional overhead associated with communicating protocols. Use of HCAs 117 in SAN 
113 also provides the I/O and IPC consumers of distributed conaputer system 100 with zero 
processor-copy data transfers without involving the o;peratittg system ketnel process. HCA 117 
and other hardware of SAN 113 provide reliable, fault tolerant communications. 

[0030] The I/O chassis includes I/O adapter backplane 106 and multiple I/O adapter nodes 105 
that contain adapter cards. Exemplary adapter cards .illustrated in Figure lA include SCSI 
adapter card 123A, adapter card 123B to fiber channel' hub and FC-AL devices, Ethemet adapter 
card 123C, graphics adapter card 123D, and video'^kiJapter card 123E. Any known type of 
adapter card can be implemented. The I/O chassis also ijacludes switch 109B in the I/O adapter 

backplane 106 to couple adapter cards 123A-123E tq'SAN 113. 

.U::c 

[0031] RAID subsystem 103 includes a microprbc^^sbr 125, memory 126, a Target Channel 
Adapter (TCA) 127, and multiple redundant and/6'r'st^ed storage disks 129. 

[0032] In the illustrated SAN 113, each link 115 is at full duplex channel between any two net- 
work elements, such as end nodes, switches 109A-l09Ci<5t routers 111. Suitable links 115 may 
include, but are not limited to, copper cables, optical cables, and printed circuit copper traces on 
backplanes and printed circuit boards. The combinatioii of links 115 and switches 109A-109C, 
etc. operate to provide point-to-point communicatibn'between nodes of SAN 113. 

SAN SOFTWARE ARCHITECTURE 
Software Components 

[0033] Software and hardware aspects of an exemplary host processor node lOX are generally 
illustrated in Figure 2. An application programfiq&ig interface (API) and other software 
components, such as an operating system (OS) and device drives are utilized to allow software 
components 220 to control hardware components 221. However, the feature of software- 
hardware interaction at the processor node 101 is not essential to the discussion of the invention 
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and further description is not required for the understanding of the present invention. The 
functionality provided by the API, OS and other components collectively provides verbs 
interface 207. Verbs mter&ce 207 operates as a conduit which separates the xjpper software 
components 220 from the lower hardware component5,221. 

[0034] Host processor node 101 executes a set of consumer processes 201, Host processor node 
101 includes HCA 117 with ports 205. Each port 205, connects to a link (see link 115 of Figure 
1 A). Ports 205 can connect to one SAN subnet or multiple SAN subnets. Utilizing message and 
data services 203, consumer processes 201 transfer messages to SAN 113 via verbs interfece 
207. Verbs interface 207 is generally implemented with an OS-specific programming interfece. 

[0035] A software model of HCA 117 is iUustiated^ih Figure 3, HCA 117 includes a set of 
queue pairs (QPs) 301, which transfer messages across ports 205 to the subnet A single HCA 
117 may support thousands of QPs 301. By contrstsfV 'TCA 127 (Figure lA) in an I/O adapter 
typically supports a much smaller number of .'QP§'v301. Figure 3 also illustmtes subnet 
management administration (SMA) 209, managembnt packets 211 and a number of virtual lanes 
213, which connect the transport layer with ports 20SJ; 

[0036] Turning now to Figure IB, there is illustrated'sbflware management model for nodes on 
SAN 113. SAN architecture management facilities provides a Subnet Manager (SM) 303A, a 
Subnet Administration (SA) 303Bj and an infrastrocture that supports a number of general 
management services. The management infrastructure requires a Subnet Management Agent 
(SMA) 307 operating in each node 305 and defines a general service interfiice that allows ad- 
ditional general services agents. Also, SAN drciiitecture defines a common management 
datapam (MAD) message structure for coramunic€ttmg between managers and management 
agents.SM 303A is responsible for initializing, configluihg and managing switches, routers, and 
channel adapters. SM 303 A can be implemented wititdri other devices, such as a channel adapter 
or a switch. One SM 303A of SAN is dedicated as a master SM and is responsible for: (1) 
discovering the subnet topology; (2) configuring each cHamiel adapter port with a range of Local 
Identification (LID) numbers, Global Identification' (GlD) number, subnet prefix, and Partition 

Keys (PJECeys); (3) configuring each switch with* a* LID, the subnet prefix, and with its for- 

I". 
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warding database; and (4) maintainiag the end^jjode and service databases for the subnet to 
provide a Global Unique Identification (GUBD) number, to LID/GID resolution service as well as 
a services directory. , . . 

[0037] In a preferred embodiment, SM 303A records all configuration information for each 
component of the network in a SM database (SMDB)/ T^w, management of SAN 113 and SAN 
components, such as HCAs 117, TCAs (or end nodes)^;127. Switches 109, and Routers 111 are 
completed utilizing Subnet Management (SM) 303>Vjand Subnet Administration (SA) 303B. 
SMPs are used to discover, initialize, configure, and maintain SAN components through 
management agents 307 of end nodes 305, SAN SA packets are used by SAN components to 
query and update subnet management data. Control; of some aspects of ttxe subnet management 
is provided via a user management console 311 in hb^it-based end node 309. 

MESSAGE TRANSFER PROCESS 

(0038] SAN 113 provides the higih-bandwidth and sc^Jlability required for I/O and also supports 
the extremely low latency and low CPU overhead ifeijuired for Inteiprocessor Commimications 
(IPC). User processes can bypass the operating sj^t^^^OS) kemel process and directly access 
network commxmication hardware, such as HCAs rll?,.: which enable efficient message passing 
protocols. SAN 113 is suited to current computing' jdtodels and is a building block for new forms 
of I/O and computer clxister communication. SAN: '113 allows I/O adapter nodes 105 to 
communicate among themselves or commimicate with 'any or all of the processor nodes 101 in 
the distributed computer system. With an I/O adaptet attached to SAN 113, the residting I/O 
adapter node 105 has substantially the same comnfuoication capability as any processor node 
101 in the distributed computer system. ' :v7 

[0039] For reliable service types of messages, end n6d6!A, such as host processor nodes 101 and 
I/O adapter nodes 105, generate request packets and recBive acknowledgment packets. Switches 
109A-109C and routers 111 pass packets along j&bm^tlie source to the target (or destination). 
Except for the variant cyclic redundancy check (eRG)'=trailer field, which is updated at each 
transfer stage in the network, switches 109A-109C f^ass the packets along unmodified. Routers 
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111 update the variant CRC trailer field and modify other fields in the header as the packet is 
routed. 

[0040] In SAN 113, the hardware provides a message passing mechanism that can be used for 
Input/Output (I/O) devices and Interprocess Coinmjmications (IPC) between general computing 
nodes. Consumers (i.e., processing devices comiected to end nodes) access SAN message 
passing hardware by posting send/receive messages to send/receive work quexws (WQ), 
respectively, on a SAN Channel Adapter (CA). 

[0041] A message is herein defined to be an application-defined unit of data exchange^ which is 
a primitive utdt of communication between cooperating processes. A packet (or frame) is herein 
defined to be one unit of data encapsulated by rifetworking protocol headers- The headers 
generally provide control and routing information for.di^^ecting the packet (or fi-ame) through 
SAN 113. The trailer generally contains control and CRC data for ensuring that frames are not 
delivered with corrupted content. -'^^Z'^V' 

[0042] Consumers use SAN verbs to access HCA'ftififtiohs. The software that interprets verbs 
and directiy accesses the CA is known as the Channer'Interface (CI) 219. Send/Receive work 
queues (WQ) are assigned to a consxmier as a Queue: Pair (QP). Messages may be sent over five 
different transport types, Reliable Connected (RC), Reliable Datagram (RD), Unreliable 
Coimected (UC), Unreliable Datagram (UD), and Datagram (RawD). Consumers retrieve 
the results of these messages fixmi a Completion* 'i^^euis (CQ) through SAN send and receive 
work completions (WC). The source CA takes cSife^ of segmenting outboxmd messages and 
sending them to the destination. The destination *ox;-.target CA takes care of reassembling 
inbound messages and placing them in the memoryt'fSpace designated by the destination's 
consumer. These features are illustrated in the figures beflbw. 

[0043] I^lgTires 4, 5, 6^ and 7 together illustrate''- Example request and acknowledgment 
transactions. Referring now to Figure 4, there is illustrated a block diagram of work and 
completion queue processing. In Figure 4, a receive work queue (RWQ) 400, send work queue 
(SWQ) 402, and completion queue 404 are preseat for processing requests from and for 
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consumer 406. These requests firom coji«nimBr 406 axe eventually sent to hardware 408. In this 
example, consumer 406 generates woxk requests 410 and, 412 and receives work completion 414. 

[0044] Each QP 301 provides an input to a Send Work Queue (SWQ) 402 and a Receive Work 
Queue (RWQ) 400* SWQ 402 sends channel andmempry semantic messages, and RWQ 400 
receives channel semantic messages. A consumer calls a verb (witiiin verbs interface 207) to 
place Work requests (WRs) into a woric queue (WQ). 

[0045] A send work request WR 412 is a channel se^caantic operation to push a set of local data 
segments to the data segments refierenced by a remote node's Receive work queue element 
(WQE). For example, work queue element 428 contains references to data segment 4 438, data 
segment 5 440, and data segment 6 442. Each orthef idata segments of the send WR contains a 
virtually contiguous memory region. The yiituaX'^5^ used to reference the local data 
segments are in the address context of the process that created the local QP 301, 

[0046] As shown in Figure 4, work requests placed*bilto:a work queue are referred to as work 
queue elements (WQEs). Send work queue 402 cohta^iis work queue elements (WQEs) 422-428, 
describing data to be transmitted on the SAN fabric. Receive work queue 400 contains work 
queue elements (WQEs) 416-420, describing where to place incoming channel semantic data 
firom the SAN fabric. WQEs are executed (jprocessca)' by hardware 408 in the HCA SWQ 407 
contains WQEs 405 that describe data to be transitrntt^d on the SAN fabric. RWQ 409 contains 
WQEs 405 that describe where to place incoming cHian&nel semantic data received firom SAN 
113. •••-•.f^^:^ 

[0047] A remote direct memory access (RDMA) read Work request provides a memory semantic 
operation to read a virtually contiguous memory space on a remote node. A memory space can 
either be a portion of a memory region or portion of fa memory window. A memory region 
references a previously registered set of virtually^ Contiguous memory addresses defined by a 
virtual address and length. A memory window reMences a set of virtually contiguous memory 
addresses which have been bound to a previously registered regioiL 
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[0048] The RDMA Read work request reads a virtually contiguous memory space on a remote 
end node and writes ihe data to a virtually contiguous local memory space. Similar to the send 
work request, virtual addresses used by the RDMA Read work queue element to reference the 
local data segments are in the address context of the process that created the local queue pair. 
For example, work queue element 416 in receive work queue 400 references data segment 1 444^ 
data segment 2 446, and data segment 448, The remote virtual addresses are in the address 
context of the process owning the remote queue pait; targeted by the RDMA Read work queue 
element. 

[0049] In one embodiment. Receive Woik Queues 400 only support one type of WQE , which is 
referred to as a receive WQE 416-420. The receive WQE 416-420 provides a channel semantic 
operation describing a local memory space into \dii4h(tiic6ming send messages are written. The 
receive WQE includes a scatter list describing seveM virtually contiguous memory spaces. An 
incoming send message is written to these memory ^^^paces. The virtual addresses are in the 
address contexts of the process that created the local' QP' 301. 

[0050] The verbs interface 207 also provides a rriechanism for retrieving completed work from 
completion queue 404. Completion queue 404 <ibnifetin:s Completion Queue Elements (CQEs) 
430-436 which contain information about previously completed WQEs. CQEs 430-436 are 
employed to create a single point of completion notijBcation for multiple QPs 301. CQE 
contams sufficient information to determine the QP'iOl-aiid specific WQE that completed. A 
completion queue context is a block of info^nat^o^ 'that - contains pointers to length and other 
information needed to manage the individual completion' queues 404, 

[0051] Turning next to Figure 5, an illustration of a daik Vpacket is depicted in accordance with a 
preferred embodiment of the present invention. Message data 500 contains data segment 1 502, 

. I ; 

data segment 2 504, and data segment 3 506, which are similar to the data segments illustrated in 
Figure 4, In this example, these data segments foim^a packet 508, which is placed into packet 
payload 510 within data packet 512. Additionally, dkta packet 512 contains CRC 514, which is 
used for error checking. Additionally, routing headef 516 and transport 518 are present in data 
packet 512. Routing header 516 is used to identify soinfce and destination ports for data packet 

• 3 

.s-h iitJi.. 
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512. Transport header 518 in this example specifies, the destination queue pair for data packet 
512. Additionally, transport header S18 also provides information such as the operation code, 
packet sequence number, and partition for data packet 512. 

[0052] The operating code identifies whether the packet is the first, last, intermediate, or only- 
packet of a message. The operation code also specifies whether the operation is a send RDMA 
write, read, or atomic. The packet sequence nTjmber^is initialized when communications is 
established and increments each time a quexje pair creates a new packet. Ports of an end node 
may be configured to be members of one or more possibly overlapping sets called partitions. 
[0053] If a reliable transport service is employed, .y^hen a request packet reaches its destination 
end node, acknowledgment packets are used by ttie' destination end node to let the request 
packet sender know the request packet was Vajic&ted and accepted at the destination. 
' Acknowledgment packets acknowledge one or moief valid and accepted request packets. The 
- requester can have multiple outstanding request packets 'before it receives any acknowledgments. 
In one embodiment, the number of multiple outstanding messages is determined when a QP is 
created. - " '^^'^ 

■■■■IV: 

' [0054] Referring to Figure 6, a schematic diagr^mViUustrating a portion of a distributed 
computer system is depicted in accordance with tlie present invention. The distributed computer 
system 600 in Figure 6 includes a host processor node'r€02 and 604. Host processor node 602 
includes a HCA 606, and host processor node- 60*4 includes a HCA 608. The distributed 
computer system 600 in Figure 6 includes a SAN" fabric 610 which includes a switch 612 and a 
switch 614. The SAN fabric 610 itx Figure 6 includ'es-t'-'fink coupling HCA 606 to switch 612; a 
link coupling switch 612 to switch. 614j and a link coupling HCA 608 to switch 614. 

[0055] In the example transactions;, host processor nddc 602 includes client process A 616, and 
host processor node 604 includes client process B 618: " Client process A 616 interacts with HCA 
hardware 606 through QP 620. Client process B 618 interacts with HCA hardware 608 through 
QP 622. QP 620 and QP 622 are data structures.' QP '620 includes send work queoie 624 and a 
receive work queue 626. QP 622 also includes send work queue 628 and receive work queue 
630. --••'^<^-^^^- 
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f 

[0056] Process A 616 initiates a message request by j>osting WQEs to the send queue 624 of QP 
620. Such a WQE is illustrated by WQE 428 in Figure 4. The message request of cliexit process 
A 616 is referenced by a gather list contained in .the:Send WQE 428. Each data segment in the 
gather list points to a virtually contiguous local memory region, which contains a part of the 
message. This is indicated by data segments 4 438, 5,^440, and 6 442, which respectively hold 
message parts 4, 5j and 6. / • 

[0057] Hardware in HCA 606 reads the WQE ard. segments the message stored in virtual 
contiguous buffers into packets, such as packet 512 in.Figure S. Packets are routed through the 
SAN fabric 610> and for reliable transfer services, ar^; acknowledged by the final destination end 
node, which in this case is host processor node 604.. If not successively acknowledged^ the 
packet is re-transmitted by the source end node, ho^it ^processor node 602. Packets are generated 
by source end nodes and consumed by destination eiid'hodes. 

[0058] Referring now to Figure 7, the send request* m^sage is transmitted from source end node 
702 to destination end node 704 as packets 1 706, 2i^08,;3 710, and 4 712. Acknowledgment 
packet 4 712 acknowledges that all 4 request packets were received, 

[0059] The message in Figure 7 is being transnuttbd^JwJth: a reliable transport service. Switches 
(and routers) that relay the request and acloiowledgmeht^ packets do not generate any packets, 

only the source and destination HCAs do (xespectively)?> 

i. 

REMOTE OPERATION FUNCTIONALITY ' ^' '^ 

[0060] SAN 113, with its interlinked arrangemerit^of components and sub-components, provides 
a method for completing remote operations, by which processor nodes may directly control 
processes in I/O nodes. Remote operation also pexniife' the network to manage itself. A remote 
direct memory access (RDMA) Read work request (Wll^'provides a memory semantic operation 
to read a virtually contiguous memory space on a remote^node. A memory space can either be a 
portion of a memory region or a portion of a memory window. A memory region references a 
previously registered set of virtually contiguous me&ory adxiresses defined by a virtual address 
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and length. A memory window references a set of \m1ually contiguous memory addresses which 

have been bound to a previously registered region. , - 

[0061] The RDMA Read "WR vvrites the data to a virtually contiguous local memory space. 
Similar to Send WR 403, virtual addresses used by the RDMA Read WQE to reference the local 
data segments axe in the address context of ttie process that created the local QP 301. The 
remote vixtoal addresses are in the address context of idie process owoing the remote QP targeted 
by the RDMA Read WQE. . .v; >. . 

[0062] RDMA Write WQE provides a tnemoiy: semantic operation to write a virtually 
contiguous memory space on a rejnote node. RDMA Write WQE contains a scatter list of local 
virtually contiguous memory spaces and the virtbajfj^ddress of the remote memory space into 
wMch the data from the local memory spaces is vvritten, 

[0063] RDMA FetchOp WQE provides a memory selictantic operation to perform an atomic 
operation on a remote word- RDMA FetchOp WQEds; -A' combined RDMA Read, Modify, and 
Write operation. RDMA FetchOp WQE can suppoK Isi^Veral read-modify-wrlte operations, such 

as "Compare and Swap if Equal." t A*^ 

J' 

[0064] A Bind (unbind) remote access key (R_Key) WQE provides a command to the HCA 
hardware to modify a memory window by associatiiig the memory window to a memory region. 
A second command to destroy a memory windoW % disassociating the memory window fix)m a 
memory region is also provided. The R_Key is of each RDMA access and is used to 
validate that the remote process has permitted access to the buffer. 

EFFICIENT PROCESS FOR HAP«)OVER BETWEN SUBNET MANAGERS 
[0065] The present invention makes use of the'fe^tfi3res^*:dfescribed in the subject matter of Patent 
Application Serial No. 09/962,354 (Attoii^e'/'f'- Docket No AUS9-2000-0625US1) 
"ASSOCIATION OF END-TO-END CONTEXT VlA 'RELIABLE DATAGRAM DOMAINS" 
filed on October 19> 2000, the entire content of which is hereby incorporated by reference. The 
referenced application allows Reliable Datagram QPs to be xised for communicatLog across 
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multiple partitiotis. Tn SAN 113, QPs that support SAN Service Types are associated %vith a 
partition and cajonot commumcate to QPs that are outside of the partition to which the QP is 
associated. The QP is harred from communicating, ydtji QPs in another partition even if the 
node_s HCA port, which the QP uses, has access to^diflerent partitions, RD QPs, however, can 
communicate with any given partition the node_s HQ A has access to, so long as there is an 
underlying End-End Context (EEC) which is associated with the given partition. 

[0066] The present invention provides a method for seamlessly merging (i.e., linking or 
connecting) two independent subnets having separate, subnet managers (SMs) into a single SAN 
113 with a master SM selected from among the sei)arate SMs. The invention utilizes time 
stamps and GUID addresses on SMDB entries as an aid for a Subnet Manager to efBciently 
absorb the SMDB of another subnet manager duriag the subnet management handover process. 

[0067] Figure 8 illustrates a SAN 113 comprisihg ''jSVa,'^^^ systems (subnets), each having 
its own master subnet managers, which are intefflSSl^^^.' 'System 1 801 includes Hostl 803, 
Master Subnet Manager SMI 805, and devices 1 A 805f 'and IB 807 all interconnected via subnet 
810, System 2 811 includes Host2 813, Master Subiitet Manager SM2 815, and devices 2A 819 
and 2B 817 connected to subnet 820. In the illustrative embodiment, both systems utilize the 
same P_Key assignments. Device Al 809 has- a PiKfey assigned with a value of 1, and device 
IB 807 has a P_Key assigned with a value of 5;' Device 2A 819 has a P_Key assigned with a 
value of 2, and device 2B 817 has a P_Key assigned with a value of 5. Thus, the two systems are 
joined together via a link 821 to from SAN 113. ^Devices IB 807 and 2B 817 of different 
systems are each assigned a P_Key value of S.'' In^ ^ 'preferred embodiment, the invention 
provides a method for handling the overlaps which occtir in the P_Key assigtunents when the 
systems are merged, as described below. The lihldh^*df systems may be completed via a 

switch as described above and illustrated in Figure 1 A:- - " 

[0068] When the two systems of Figure 8 are iiiitiaEly- merged, one of the two master subnet 
managers is relegated to a standby subnet manager;* while the other SM becomes the master 
subnet manager of the new merged subnet (SAInT 113). Selection of the particular SM that 
becomes the master SM of the merged system may tie dependent on priority values associated 
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with each SM during their initial configuration, in. a priority scheme, the SM with the highest 
priority is automatically selected as the master. SNC.!! The Priority value may he assigned by a 
subnet adtninistrator when the subnet is initially conSgured. In the event that two SMs have the 
same priority value, then, in the preferred embodiment,, the SM with the lowest GUID is selected 
to be the master subnet manager. During the handover process, the master subnet manager of the 
newly-formed merged network receives the SMDB of the other subnet manager, reconfigures the 
network or network components where necessary and takes over management of the total 
merged network. 4 

[0069] The method utilized by the present inventioR- to accomplish the seamless transfer of 
SMDB involves timestamping each SMDB with the time at which the last change was made in 
that SMDB. With this timestamp, the SM thatis,accei5ti|f^ handover can then determine the 
correct action to take in handling each entry as the' 5MI^B($) being taken over is being absorbed. 
Referring to Figure 9A, the process of time stamjJiiig'^'entries in SMDB is illustrated. The 
process begins at block 901, and following, a det^roaiiiatibh is made whether there is a change to 
a Subnet Management Database at block 903. If not theh the process loops back to 901, If there 
was a change , then the entry is time stamped witij^litife time of the last change to the entry in that 
SMDB, and the process returns to block 901 . /, - - 

[0070] Referring now to Figure 9B, the process begins at block 951. Next, a determination is 
made at block 953 whether there is a merge of sub'he?ts occitrritig. If, however, a merge of 
subnets is occurring, tibte master subnet manager of the ni^W subnet being fortned by merging two 
or more subnets begins to examine all entries of SKHj^M^^^^ it's own SMDBb at block 9SS. 
As utilized herein, SMDBx refers to the database^'^Bf ' subnets whose subnet managers are not 
selected as the master subnet manager when a mergef of .the subnets occujrs. SMDBb refers to 
the database of the master subnet manager prior to me^^eif of the databases. 

. . i • 

[0071] Returning to Figure 9B, a determination is tiiade at block 957 whether the master subnet 
maxiager finds the same GUED entry in both SMbBx and SMBDb. If the same GUID entry is 
found, the master subnet manager keeps the one with the latest time stamp and discards the other 
at block 959- If, however, the same GUIDs are not found, the process moves to block 961. 
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[0072] Next a deteiminatioii is made at block 961 whether all GUIDS have been examined in 
SMDBx. If not all the GUIDs have been exammed, . then the process returns to block 957; 
hoiwver, if all the GUDDs have been examine^, . the process moves to block 963, v/here a 
determination is made whether SMDBx has the same. P_Key entries for different GUEDS as in 
SMDBb. If yes, then the master subnet manager will change all occurrences of that P_Kcy in 
SMDBx to a new unique P_Key value, which is different than that in either SMDBx or SMDBb 
at block 965. The process then moves to block 967. If the P_Key entries are not the same for 
SMDBx and SMDBb, then a determination is made at block 967 whether all entries in SMDBx 
have been examined. If all the entries have not been examined, the process returns to block 963. 
If all entries have been examined in SMDBx, then. the; SMDBx is merged with SMDBb at block 
969. Next, a deteimination is made at block 971 whether there is another database to be merged 
with master subnet manager's SMDBb, and if so,' ttis^irprocess returns to block 955. Otherwise, 
the process ends at block 973. ■ " ' 

[0073] There may be other overlapping informatioifiigi'the respective databases that may need to 
be merged as described for the P_Keys and GUID's -above. These would follow a similar process 
as described in Figure 9B. ^V,'" 

[0074] In the above detailed description of the preferred embodiments, reference is made to the 
accompanying drawings which form a part hereof, and' iri which is shown, by way of illxistration, 
specific embodiments in which the invention maybe^pfacticed. It is to be imderstood that other 
embodiments may be utilized and structural or Ib^fiaJ *^hanges may be made without departing 
from the scope of the present invention. For examiil^.-'^stlthough the invention is described with 
reference to multiple computing nodes accessing a single database, the invention is apphcable to 
aU other transactions occurrmg on the network. The above detailed description, therefore, is not 
to be taken in a limiting sense, and the scope of the/ptesent invention is defined by the appended 
claims. ' . 

[0075] As a final matter, it is important to note that while an iUuatrative embodiment of the 
present invention has been, and will continue to be^ described in the context of a fully functional 
data processix^ system, those skilled in the art will appreciate that the software aspects of an 

AUS92000062IUS1 -19^ , 

PA(S 46/48 * RCVD AT 1017/2004 5:03:57 PM [Eastern Da^^^^ 



OCT/07/2004/THU 04:11 PM DILLON & YUDELL, LLP FAX No. 5123436446 P. 047/048 

iUustrative embodiment of the present invention are capable of being distributed as a program 
product in a variety of forms, and that an illustrative cflalDodiment of the present invention applies 
equally regardless of the particular type of signal beiaring media xjsed to actually carry out the 
distributioiL Examples of signal bearing media include recordable type media such as floppy 
disks, hard disk drives, CD ROMs, and transmission type media such as digital and analogue 
communication links. . , . , , 
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